Search CORE

1,811 research outputs found

Deep Ordinal Reinforcement Learning

Author: C Wirth
CJ Watkins
RS Sutton
V Mnih
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 11/07/2019
Field of study

Reinforcement learning usually makes use of numerical rewards, which have nice properties but also come with drawbacks and difficulties. Using rewards on an ordinal scale (ordinal rewards) is an alternative to numerical rewards that has received more attention in recent years. In this paper, a general approach to adapting reinforcement learning problems to the use of ordinal rewards is presented and motivated. We show how to convert common reinforcement learning algorithms to an ordinal variation by the example of Q-learning and introduce Ordinal Deep Q-Networks, which adapt deep reinforcement learning to ordinal rewards. Additionally, we run evaluations on problems provided by the OpenAI Gym framework, showing that our ordinal variants exhibit a performance that is comparable to the numerical variations for a number of problems. We also give first evidence that our ordinal variant is able to produce better results for problems with less engineered and simpler-to-design reward signals.Comment: replaced figures for better visibility, added github repository, more details about source of experimental results, updated target value calculation for standard and ordinal Deep Q-Networ

arXiv.org e-Print Archive

Crossref

Water resources management in a homogenizing world: Averting the Growth and Underinvestment trajectory

Author: Hjorth P
Huckins CJ
Madani K
Mirchi A
Watkins DW
Publication venue: 'Wiley'
Publication date: 01/01/2014
Field of study

Biotic homogenization, a de facto symptom of a global biodiversity crisis, underscores the urgency of reforming water resources management to focus on the health and viability of ecosystems. Global population and economic growth, coupled with inadequate investment in maintenance of ecological systems, threaten to degrade environmental integrity and ecosystem services that support the global socioeconomic system, indicative of a system governed by the Growth and Underinvestment (G&U) archetype. Water resources management is linked to biotic homogenization and degradation of system integrity through alteration of water systems, ecosystem dynamics, and composition of the biota. Consistent with the G&U archetype, water resources planning primarily treats ecological considerations as exogenous constraints rather than integral, dynamic, and responsive parts of the system. It is essential that the ecological considerations be made objectives of water resources development plans to facilitate the analysis of feedbacks and potential trade-offs between socioeconomic gains and ecological losses. We call for expediting a shift to ecosystem-based management of water resources, which requires a better understanding of the dynamics and links between water resources management actions, ecological side-effects, and associated long-term ramifications for sustainability. To address existing knowledge gaps, models that include dynamics and estimated thresholds for regime shifts or ecosystem degradation need to be developed. Policy levers for implementation of ecosystem-based water resources management include shifting away from growth-oriented supply management, better demand management, increased public awareness, and institutional reform that promotes adaptive and transdisciplinary management approaches

Lund University Publications

Michigan Technological University

Spiral - Imperial College Digital Repository

Learning Best Response Strategies for Agents in Ad Exchanges

Author: CJ Watkins
EL Kaplan
M Schain
P Auer
S Albrecht
S Muthukrishnan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 10/02/2019
Field of study

Ad exchanges are widely used in platforms for online display advertising. Autonomous agents operating in these exchanges must learn policies for interacting profitably with a diverse, continually changing, but unknown market. We consider this problem from the perspective of a publisher, strategically interacting with an advertiser through a posted price mechanism. The learning problem for this agent is made difficult by the fact that information is censored, i.e., the publisher knows if an impression is sold but no other quantitative information. We address this problem using the Harsanyi-Bellman Ad Hoc Coordination (HBA) algorithm, which conceptualises this interaction in terms of a Stochastic Bayesian Game and arrives at optimal actions by best responding with respect to probabilistic beliefs maintained over a candidate set of opponent behaviour profiles. We adapt and apply HBA to the censored information setting of ad exchanges. Also, addressing the case of stochastic opponents, we devise a strategy based on a Kaplan-Meier estimator for opponent modelling. We evaluate the proposed method using simulations wherein we show that HBA-KM achieves substantially better competitive ratio and lower variance of return than baselines, including a Q-learning agent and a UCB-based online learning agent, and comparable to the offline optimal algorithm

arXiv.org e-Print Archive

Crossref

Recommended from our members

Performance Enhancement of Deep Reinforcement Learning Networks using Feature Extraction

Author: CJ Watkins
D Silver
D Silver
G Tesauro
GE Hinton
MG Bellemare
R Bellman
V Mnih
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

The combination of Deep Learning and Reinforcement Learning, termed Deep Reinforcement Learning Networks (DRLN), offers the possibility of using a Deep Learning Neural Network to produce an approximate Reinforcement Learning value table that allows extraction of features from neurons in the hidden layers of the network. This paper presents a two stage technique for training a DRLN on features extracted from a DRLN trained on a identical problem, via the implementation of the Q-Learning algorithm, using TensorFlow. The results show that the extraction of features from the hidden layers of the Deep Q-Network improves the learning process of the agent (4.58 times faster and better) and proves the existence of encoded information about the environment which can be used to select the best action. The research contributes preliminary work in an ongoing research project in modeling features extracted from DRLNs

City Research Online

Crossref

Pseudorehearsal in value function approximation

Author: A Robins
A Robins
B Baddeley
CJ Watkins
J Gama
JL McClelland
JN Tsitsiklis
KP Murphy
M Frean
M Hattori
M McCloskey
R Coop
R Ratcliff
RJ Williams
RM French
RS Sutton
S Adam
Publication venue
Publication date: 21/03/2017
Field of study

Catastrophic forgetting is of special importance in reinforcement learning, as the data distribution is generally non-stationary over time. We study and compare several pseudorehearsal approaches for Q-learning with function approximation in a pole balancing task. We have found that pseudorehearsal seems to assist learning even in such very simple problems, given proper initialization of the rehearsal parameters

arXiv.org e-Print Archive

Crossref

Identifying Critical States by the Action-Based Variance of Expected Return

Author: CJ Watkins
G Liu
IH Witten
M Stolle
MG Bellemare
SJ Kazemitabar
V Mnih
Y Kuniyoshi
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 08/11/2020
Field of study

The balance of exploration and exploitation plays a crucial role in accelerating reinforcement learning (RL). To deploy an RL agent in human society, its explainability is also essential. However, basic RL approaches have difficulties in deciding when to choose exploitation as well as in extracting useful points for a brief explanation of its operation. One reason for the difficulties is that these approaches treat all states the same way. Here, we show that identifying critical states and treating them specially is commonly beneficial to both problems. These critical states are the states at which the action selection changes the potential of success and failure substantially. We propose to identify the critical states using the variance in the Q-function for the actions and to perform exploitation with high probability on the identified states. These simple methods accelerate RL in a grid world with cliffs and two baseline tasks of deep RL. Our results also demonstrate that the identified critical states are intuitively interpretable regarding the crucial nature of the action selection. Furthermore, our analysis of the relationship between the timing of the identification of especially critical states and the rapid progress of learning suggests there are a few especially critical states that have important information for accelerating RL rapidly.Comment: 12 pages, 6 figure

arXiv.org e-Print Archive

Crossref

P17-05. Dealing with HIV-1 diversity

Author: Borthwick N
Bridgeman A
Colloca S
Hanke T
Liljestrom P
Melief CJ
Nicosia A
Quakkelaar ED
Rosario M
Watkins D
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Oxford University Research Archive

Adherence and persistence to direct oral anticoagulants in atrial fibrillation: a population-based study

Author: Antoniou S
Banerjee A
Benedetto V
Burnell J
Gichuru P
Marshall T
Ryan R
Schilling RJ
Strain WD
Sutton CJ
Watkins C
Publication venue
Publication date: 24/12/2019
Field of study

Background Despite simpler regimens than vitamin K antagonists (VKAs) for stroke prevention in atrial fibrillation (AF), adherence (taking drugs as prescribed) and persistence (continuation of drugs) to direct oral anticoagulants are suboptimal, yet understudied in electronic health records (EHRs). Objective We investigated (1) time trends at individual and system levels, and (2) the risk factors for and associations between adherence and persistence. Methods In UK primary care EHR (The Health Information Network 2011–2016), we investigated adherence and persistence at 1 year for oral anticoagulants (OACs) in adults with incident AF. Baseline characteristics were analysed by OAC and adherence/persistence status. Risk factors for non-adherence and non-persistence were assessed using Cox and logistic regression. Patterns of adherence and persistence were analysed. Results Among 36 652 individuals with incident AF, cardiovascular comorbidities (median CHA2DS2VASc[Congestive heart failure, Hypertension, Age≥75 years, Diabetes mellitus, Stroke, Vascular disease, Age 65-74 years, Sex category] 3) and polypharmacy (median number of drugs 6) were common. Adherence was 55.2% (95% CI 54.6 to 55.7), 51.2% (95% CI 50.6 to 51.8), 66.5% (95% CI 63.7 to 69.2), 63.1% (95% CI 61.8 to 64.4) and 64.7% (95% CI 63.2 to 66.1) for all OACs, VKA, dabigatran, rivaroxaban and apixaban. One-year persistence was 65.9% (95% CI 65.4 to 66.5), 63.4% (95% CI 62.8 to 64.0), 61.4% (95% CI 58.3 to 64.2), 72.3% (95% CI 70.9 to 73.7) and 78.7% (95% CI 77.1 to 80.1) for all OACs, VKA, dabigatran, rivaroxaban and apixaban. Risk of non-adherence and non-persistence increased over time at individual and system levels. Increasing comorbidity was associated with reduced risk of non-adherence and non-persistence across all OACs. Overall rates of ‘primary non-adherence’ (stopping after first prescription), ‘non-adherent non-persistence’ and ‘persistent adherence’ were 3.5%, 26.5% and 40.2%, differing across OACs. Conclusions Adherence and persistence to OACs are low at 1 year with heterogeneity across drugs and over time at individual and system levels. Better understanding of contributory factors will inform interventions to improve adherence and persistence across OACs in individuals and populations

UCL Discovery

Learning from Monte Carlo Rollouts with Opponent Models for Playing Tron

Author: AL Samuel
CJ Watkins
D Silver
D Silver
G Tesauro
J Baxter
J Schmidhuber
L Kocsis
M Otterlo van
RS Sutton
RS Sutton
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 30/12/2018
Field of study

This paper describes a novel reinforcement learning system for learning to play the game of Tron. The system combines Q-learning, multi-layer perceptrons, vision grids, opponent modelling, and Monte Carlo rollouts in a novel way. By learning an opponent model, Monte Carlo rollouts can be effectively applied to generate state trajectories for all possible actions from which improved action estimates can be computed. This allows to extend experience replay by making it possible to update the state-action values of all actions in a given game state simultaneously. The results show that the use of experience replay that updates the Q-values of all actions simultaneously strongly outperforms the conventional experience replay that only updates the Q-value of the performed action. The results also show that using short or long rollout horizons during training lead to similar good performances against two fixed opponents

Crossref

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

Novel insights into diminished cardiac reserve in non-obstructive hypertrophic cardiomyopathy from four-dimensional flow cardiac magnetic resonance component analysis

Author: Ashkir Z
Carlhäll Cj
Ebbers T
Hess A
Johnson S
Lewandowski Aj
Mahmod M
Myerson S
Neubauer S
Raman B
Watkins H
Wicks E
Publication venue: Oxford University Press
Publication date: 01/01/2023
Field of study

Aims: Hypertrophic cardiomyopathy (HCM) is characterized by hypercontractility and diastolic dysfunction, which alter blood flow haemodynamics and are linked with increased risk of adverse clinical events. Four-dimensional flow cardiac magnetic resonance (4D-flow CMR) enables comprehensive characterization of ventricular blood flow patterns. We characterized flow component changes in non-obstructive HCM and assessed their relationship with phenotypic severity and sudden cardiac death (SCD) risk. Methods and results: Fifty-one participants (37 non-obstructive HCM and 14 matched controls) underwent 4D-flow CMR. Left-ventricular (LV) end-diastolic volume was separated into four components: direct flow (blood transiting the ventricle within one cycle), retained inflow (blood entering the ventricle and retained for one cycle), delayed ejection flow (retained ventricular blood ejected during systole), and residual volume (ventricular blood retained for >two cycles). Flow component distribution and component end-diastolic kinetic energy/mL were estimated. HCM patients demonstrated greater direct flow proportions compared with controls (47.9 ± 9% vs. 39.4 ± 6%, P = 0.002), with reduction in other components. Direct flow proportions correlated with LV mass index (r = 0.40, P = 0.004), end-diastolic volume index (r = −0.40, P = 0.017), and SCD risk (r = 0.34, P = 0.039). In contrast to controls, in HCM, stroke volume decreased with increasing direct flow proportions, indicating diminished volumetric reserve. There was no difference in component end-diastolic kinetic energy/mL. Conclusion: Non-obstructive HCM possesses a distinctive flow component distribution pattern characterised by greater direct flow proportions, and direct flow-stroke volume uncoupling indicative of diminished cardiac reserve. The correlation of direct flow proportion with phenotypic severity and SCD risk highlight its potential as a novel and sensitive haemodynamic measure of cardiovascular risk in HCM

Publikationer från Linköpings universitet

Oxford University Research Archive

Digitala Vetenskapliga Arkivet - Academic Archive On-line